Unit 3 topics (that we’ve covered thus far)

Week 10 - Sampling distributions and confidence intervals

Effect size

Confidence intervals provide an (interval) estimate for the effect of interest. Hence, confidence intervals, and not hypothesis tests, can inform us about the effect size. This makes it easier for us to compare the results of our statistical analysis to a practical understanding of what kind of “effect” would be important in a particular problem setting.

Practical significance is not determined by statistical significance! Statistical significance is not determined by practical significance!

Week 11 - Hypothesis tests for an unknown proportion or mean

Hypotheses are typically designed so that what we want to prove is expressed in the alternative. For all of the methods that we’ve covered thus far, the null hypothesis is always going to be of the form \[H_0: \text{<parameter> } = \text{ some number}\]

Types of conclusions

The only way to reduce both types of error is to collect more evidence or, in statistical terms, to collect more data.

  • \(\alpha = Pr(\text{Type I error})\): If \(H_0\) is true, this is the probability that we (incorrectly) reject it.

  • \(\beta = Pr(\text{Type II error})\): If \(H_0\) is false, this is the probability that we (incorrectly) fail to reject it.

  • \(1-\beta = Power\) If \(H_0\) is false, this is the probability that we (correctly) reject it.

The logic of hypothesis tests is similar to the logic behind inter-universe travel in the movie Everything Everywhere All at Once…

Week 12 - Inference from two samples (grouped data)

Example: Confidence interval for a difference in means (from Week 12)

On average, how much more money do consumers spend at Target compared to Walmart?

Suppose researchers collected a systematic sample from \(85\) Walmart customers and \(80\) Target customers by asking them for their purchase amount as they left the stores. The data they collected is summarized in the table below. Suppose a computer already calculated the degrees of freedom to be \(162.75\).

Walmart Target
\(\bar{x}\) \(\$45\) \(\$53\)
s \(\$21\) \(\$19\)

Step 1) Identify and define the population parameter and choose your confidence level.

Step 2) Calculate the sample estimate for the population parameter.

Step 3) Assess the required assumptions and conditions.

Step 4) Find the critical value corresponding to your confidence level.

Step 5) Calculate the standard error of your sample estimate.

Step 6) Calculate the lower and upper bounds of your confidence interval.


Example: Confidence interval for a mean difference of paired data (from Week 13)

On average, how large is the difference in car insurance prices for customers of an online insurance company versus customers of a local insurance company?

Find a \(95\%\) confidence interval for the mean difference in insurance prices based on the data given below. The data below represents randomly selected insurance profiles (type of car, coverage, driving record, etc.) for 10 clients at a local provider and the corresponding quote from another online provider given their policy information.

mean(insurance_diff$PriceDiff)
## [1] 45.9
sd(insurance_diff$PriceDiff)
## [1] 175.6628

Looking ahead


Partial Solutions

Example: Confidence interval for a difference in means (from Week 12)

Step 1) \(\mu_1 - \mu_2 =\) mean amount spent at Target minus mean amount spent at Walmart. We’ll use a 95% confidence level.

Step 2) \(\bar{x}_1 - \bar{x}_2 = 8\)

Step 3) Assess the required assumptions and conditions - done in class.

Step 4) We need the critical \(t^*\) value corresponding to a 0.95 confidence level from a Student’s t distribution with \(162.75\) degrees of freedom. We can find this exactly using R and this value should be similar to the approximate critical value which you can read off the t-table.

qt(0.025, df = 162.75, lower.tail=TRUE)
## [1] -1.974647

Step 5) \(SE(\bar{x}_1 - \bar{x}_2) = \sqrt(\frac{19^2}{80} + \frac{21^2}{85}) = 3.115\)

Step 6) $ 8 (1.975 ) = [$1.848, $14.152]$ with interpretation given in class.


Example: Confidence interval for a mean difference of paired data (from Week 13)

Step 1) Identify and define the population parameter and choose your confidence level.

\(\mu_{Diff} =\) the mean difference in insurance prices between online and local providers (local minus online)

Let’s use a 90% confidence level to mix things up.

Step 2) Calculate the sample estimate for the population parameter.

\(\bar{d} = \$45.9\)

Step 3) Assess the required assumptions and conditions.

  • Independence

    • 10% condition

    • Randomization condition

  • Sample size (or nearly Normal) condition

The data is representative of the local insurance company because these 10 profiles were randomly selected. It’s not clear how large the local insurance company is but it’s pretty likely that the company has more than 100 customers. Therefore, there isn’t any strong indicator that the difference data is not independent. (I.e. we can assume within sample independence.) However, the sample size is rather small so in order to use the CLT, we need to check a histogram of the difference data. This histogram is symmetric and unimodal so it seems reasonable that the larger population of all possible differences between process for customers of this local company is approximately Normally distributed. There aren’t any major red flags against any of the necessary assumptions for this method.

Step 4) Find the critical value corresponding to your confidence level.

\(t^*_{0.90, dff=10-1}=1.833\) (note this is also the value you’d find using the t-table)

## [1] -1.833113

Step 5) Calculate the standard error of your sample estimate.

\(SE(\bar{d}) = \frac{175.66}{\sqrt{10}} = 55.549\)

Step 6) Calculate the lower and upper bounds of your confidence interval.

\(45.9 \pm \left(1.833 \times 55.549 \right) = [-55.928,147.728]\)

Thus, we are \(90\%\) confident that the true mean difference in insurance prices between this online and this local provider (local minus online) is between -$55.928 and $147.728. In other words, the local provider is anywhere from $55.98 cheaper to $147.73 more expensive than the online provider.

We can check our answer in R using the following code:

## 
##  One Sample t-test
## 
## data:  insurance_diff$PriceDiff
## t = 0.82629, df = 9, p-value = 0.43
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  -55.92845 147.72845
## sample estimates:
## mean of x 
##      45.9